Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an option for --disable-nodegroup-eviction when deleting a cluster #4659

Conversation

AmitBenAmi
Copy link
Contributor

@AmitBenAmi AmitBenAmi commented Jan 19, 2022

Description

When deleting a cluster that has pods with PDBs, I can not delete the whole cluster, since the nodes won't evict all of the workloads, causing the node groups draining step to fail.
Today, there is a flag when draining node groups of: --disable-eviction, which ignores PDB errors while draining the node groups.

I've added a flag: --disable-nodegroup-eviction, that will exist when deleting a cluster, and achieves the same behavior of the drain command with the --disable-eviction flag (it literally calls the drain function with the same value).

fixes #4416

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@cPu1 cPu1 added the kind/feature New feature or request label Jan 19, 2022
@@ -159,7 +163,7 @@ func checkForUndeletedStacks(stackManager manager.StackManager) error {
return nil
}

func drainAllNodegroups(cfg *api.ClusterConfig, ctl *eks.ClusterProvider, stackManager manager.StackManager, clientSet kubernetes.Interface, allStacks []manager.NodeGroupStack) error {
func drainAllNodeGroups(cfg *api.ClusterConfig, ctl *eks.ClusterProvider, clientSet kubernetes.Interface, allStacks []manager.NodeGroupStack, disableEviction *bool, nodeGroupDrainer NodeGroupDrainer, vpcCniDeleter VpcCniDeleter) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to make disableEviction a pointer type.

Suggested change
func drainAllNodeGroups(cfg *api.ClusterConfig, ctl *eks.ClusterProvider, clientSet kubernetes.Interface, allStacks []manager.NodeGroupStack, disableEviction *bool, nodeGroupDrainer NodeGroupDrainer, vpcCniDeleter VpcCniDeleter) error {
func drainAllNodeGroups(cfg *api.ClusterConfig, ctl *eks.ClusterProvider, clientSet kubernetes.Interface, allStacks []manager.NodeGroupStack, disableEviction bool, nodeGroupDrainer NodeGroupDrainer, vpcCniDeleter VpcCniDeleter) error {

Copy link
Contributor Author

@AmitBenAmi AmitBenAmi Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

})
})
_, err := cmd.execute()
Expect(err).To(Not(HaveOccurred()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
Expect(err).To(Not(HaveOccurred()))
Expect(err).NotTo(HaveOccurred())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@AmitBenAmi AmitBenAmi force-pushed the disabling-nodes-eviction-when-deleting-cluster-with-force-flag branch from 4ed383e to b340701 Compare January 19, 2022 12:41
@aclevername
Copy link
Contributor

Thanks for opening the PR @AmitBenAmi, descriptions makes sense 👌 will take a look. +10000 for the lovely tests 😄

type NodeGroupDrainer interface {
Drain(nodeGroups []eks.KubeNodeGroup, plan bool, maxGracePeriod, nodeDrainWaitPeriod time.Duration, undo, disableEviction bool) error
}
type VpcCniDeleter func(clusterName string, ctl *eks.ClusterProvider, clientSet kubernetes.Interface)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to export this.

Suggested change
type VpcCniDeleter func(clusterName string, ctl *eks.ClusterProvider, clientSet kubernetes.Interface)
type vpcCNIDeleter func(clusterName string, ctl *eks.ClusterProvider, clientSet kubernetes.Interface)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@@ -0,0 +1,127 @@
package cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhmm does this test file actually get run when you run ginkgo -r pkg/action/cluster/? I would of thought it needs to be in cluster_test suite to register. But I get that it needs to have access to the private function drainAllNodeGroups.

I'd prefer if we made drainAllNodeGroups public, and moved this file into cluster_test package.

what do you think 😄 ?

Copy link
Contributor Author

@AmitBenAmi AmitBenAmi Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my initial problem, which caused me to move it to a different package.
I didn't want to make the drainAllNodeGroups public so I moved it to be in the same package.
Do you think it is a good thing to make it publicly?

Also, when I run ginkgo -r ./pkg/action/cluster it does run these tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is a good thing to make it publicly?

I prefer it to testing private functions by including the test package in the code package. But some folks might disagree 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by including the test package in the code package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He means, that he prefers making the function public, and then testing it, rather than having the test code in the same package as the code it self, aka, the test code is in cluster package rather than cluster_test package which makes it a separate package. The benefit of having cluster_test is that you test the public API of the package. If a private function can't be tested through that, it means that it would need a bit of refactoring maybe or it's not working encapsulating, etc.

This isn't a dogma of course. :) It's a suggestion. Try and test private functions through utilising the public function. If it's hard to test, maybe there are certain conditions which are difficult to achieve in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well summarised @Skarlso . Does that make sense @AmitBenAmi ? Its not a blocker on merging as its a personal preference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aclevername @Skarlso I changed it to fit this desire.
Basically, I also agree, and I wasn't happy with setting the test file inside the same package.
I synthetically exported the private function inside the export_test.go file, so I could test it outside of the cluster package, and now the tests file is in its own cluster_test package

@AmitBenAmi AmitBenAmi force-pushed the disabling-nodes-eviction-when-deleting-cluster-with-force-flag branch 2 times, most recently from 0247f14 to 63b5a8d Compare January 21, 2022 13:02
nodeGroupManager := c.newNodeGroupManager(c.cfg, c.ctl, clientSet)
if err := drainAllNodeGroups(c.cfg, c.ctl, clientSet, allStacks, disableNodegroupEviction, nodeGroupManager, attemptVpcCniDeletion); err != nil {
if force {
logger.Warning("error occurred during nodegroups draining")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Warning("error occurred during nodegroups draining")
logger.Warning("error %q occurred during nodegroups draining, force=true so proceeding with deletion", err.Error())

probably find better wording 😆 but the improvement makes sense 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})

err := c.Delete(time.Microsecond, false, false, false)
Expect(err).To(HaveOccurred())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we improve this by changing to

Suggested change
Expect(err).To(HaveOccurred())
Expect(err).To(MatchError("whatever error I expected"))

or if we don't know the full error: Expect(err).To(MatchError(ContainSubstring("whatever sub-error I expected")))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in some of the other test 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to match the mocked exception message

@aclevername aclevername requested a review from cPu1 January 24, 2022 13:07
@AmitBenAmi AmitBenAmi force-pushed the disabling-nodes-eviction-when-deleting-cluster-with-force-flag branch 2 times, most recently from 3d5356b to fdac5d6 Compare January 24, 2022 19:48
nodeGroupManager := nodegroup.New(cfg, ctl, clientSet)
if err := nodeGroupManager.Drain(cmdutils.ToKubeNodeGroups(cfg), false, ctl.Provider.WaitTimeout(), 0, false, false); err != nil {

if err := nodeGroupDrainer.Drain(cmdutils.ToKubeNodeGroups(cfg), false, ctl.Provider.WaitTimeout(), 0, false, disableEviction); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but Drain has now quite a lot of parameters. Would be nice to have them in a struct rather, where we can clearly read which 0 and which false, false belongs to which parameter. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 116 to 118
if force {
logger.Warning("an error occurred during nodegroups draining, force=true so proceeding with deletion: %q", err.Error())
} else {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if force {
logger.Warning("an error occurred during nodegroups draining, force=true so proceeding with deletion: %q", err.Error())
} else {
return err
}
if !force {
return err
}
logger.Warning("an error occurred during nodegroups draining, force=true so proceeding with deletion: %q", err.Error())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a preference. :) I rather skip elses to ease reading the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 91 to 93
if force {
logger.Warning("an error occurred during nodegroups draining, force=true so proceeding with deletion: %q", err.Error())
} else {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if force {
logger.Warning("an error occurred during nodegroups draining, force=true so proceeding with deletion: %q", err.Error())
} else {
return err
}
if !force {
return err
}
logger.Warning("an error occurred during nodegroups draining, force=true so proceeding with deletion: %q", err.Error())

It's a bummer that this has to be repeated. :/ ¯\_(ツ)_/¯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this is fantastic work. :) I have like two nits, and the fact that it has to be duplicated for owned and unowned. I wished that wouldn't be the case, but nothing you can do about that right now. :)

@AmitBenAmi AmitBenAmi force-pushed the disabling-nodes-eviction-when-deleting-cluster-with-force-flag branch from df47c12 to 5e9d7a9 Compare January 30, 2022 08:06
@AmitBenAmi
Copy link
Contributor Author

Hey @aclevername ,
What are the next steps in order to merge this PR?

@Himangini
Copy link
Collaborator

Hey @aclevername , What are the next steps in order to merge this PR?

@AmitBenAmi we'll merge this as soon as the build is green :) thanks for all the hard work 🎉

…ce-flag' of github.com:AmitBenAmi/eksctl into disabling-nodes-eviction-when-deleting-cluster-with-force-flag
auto-merge was automatically disabled January 31, 2022 12:48

Head branch was pushed to by a user without write access

@AmitBenAmi
Copy link
Contributor Author

@Himangini Thanks, I pushed another commit to fix the lint errors

@Skarlso Skarlso enabled auto-merge (squash) January 31, 2022 13:53
@Skarlso Skarlso merged commit 8de7b96 into eksctl-io:main Jan 31, 2022
@Skarlso
Copy link
Contributor

Skarlso commented Jan 31, 2022

Finally. :D :D Well done! Nice job. Thank you for your contribution @AmitBenAmi.

@AmitBenAmi
Copy link
Contributor Author

@Skarlso Thanks for the help and feedback, it was a pleasure 😄

@AmitBenAmi AmitBenAmi deleted the disabling-nodes-eviction-when-deleting-cluster-with-force-flag branch January 31, 2022 16:31
@hspencer77 hspencer77 mentioned this pull request Jul 8, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster deletion failed because of PodDisruptionBudget
5 participants